We will nest our A1 document in this notebook to continue our analysis.
Publication Title: Transcriptomic analyses reveal rhythmic and CLOCK-driven pathways in human skeletal muscle
Publication Date: 2018-04-16
Publication Journal: eLife
GEO ID: GSE108539
The dataset has intronic and extronic regions for each sample. The groups were defined as per the following definitions [1]:
(1) Exonic: if it occurs in at least one of the transcripts
(2) Intronic: if it is shared between all the transcripts
Also, genes with less than two intronic reads or 10 exonic reads on average were discarded.
Platform Title: Illumina HiSeq 2500 (Homo sapiens)
Original submission date: Mar 14 2013
Last update date: Mar 27 2019
Organism: Homo sapiens
No. of GEO datasets that use this technology: 6080
No. of GEO samples that use this technology: 167630
We have 16093 genes expressed in the dataset. These genes are based on 114 samples from 10 study participants. Each participant was sampled at 6 times in 4 hour intervals.
The available count data on GEO was already cleaned, normalized and filtered as described in [1]. The authors of the dataset performed the following pre-processing procedure:
1. Transcripts with lower than 3 CPM were removed
2. Transcripts that were not aligned were removed
3. Scaling factors were determined using the Trimmed M-values method [2] was applied
We will first check if any genes are duplicated based on the ensembl ids provided:
We find the no duplicates, due to the fact that the data has already been cleaned.
Both box plots show us a very well normalized set of data. No outliers can be seen from this. A key difference to note here is the difference in medians between intron and exon sequences.
Again we see a similar trend as we did in the box plot. Intron sequences (green lines) are distributed evenly but are quite distinct from exon sequences (blue).
From the MDS plot, overall we see the clustering of majority of intron and exon samples. The more interesting pattern is found when observed the samples that are not clustered with the others; we see the y-axis values are similar introns and exon samples with the same participants and time. Hence, there does not seem to be any significant outliers, and we will continue downstream analysis with all samples.
Next, we will show common vs tagwise dispersion. Dispersion measures the how much our counts deviate from the mean.
This is not an expected or normal dispersion plot. Typically, the variance should decrease with increase in log CPM, but we have the opposite here. Investigation (reading paper, googling the issue, etc) to correct this was conducted but to no avail. Further investigation will be conducted to understand this.
Next, we will plot the mean and variance relationship.
In this plot, the blue line shows the Negative Binomial. The black line shows the Poisson mean-variance relationship. The best fit of the data is marked by the red crosses. The data seems to have been normalized well and there is no overdispersion.
Here we will get information on the identifier datasets and map the ensembl IDs to the current HUGO symbols:
Here we see that there is ensembl_gene_id, exactly what we are looking for. We will use this to map the ensembl ids in our dataset.
Now that we have mapped the genes to HGNC symbols, we will check if there are any ensembl ids that have duplicate HGNC symbols:
Hence, we have 3 genes that have duplicate mapping. Since this is a very small number relative to our overall database, we will not filter these out just yet.
The difference between number of mappings we found and number of genes in the normalized dataset is 148
This number however does not tell the fully story. We will merge the mappings with the normalized dataset to get a better picture of the mapping.
Based on our merged ensembl ids to HGNC symbols, we will identify the number of genes that did not match:
148 identifiers are missing.
148 out of 16093 have not been mapped. This accounts 0.92% of the genes that were not identifed. To address the unidentified genes, we will attempt to use a new tool called Biobtree [3]. This is an alternative tool to biomarRt, using a different dataset, hence we may be able to map some more unidentified genes.
What are the control and test conditions of the dataset?
The control condition was skeletal muscle biopsies collected from human samples who were placed under rigid laboratory conditions. Biopsies were taken every 4 hours across a 24 hour period from 10 individuals. The test conditions were performed on cultured human primary skeletal myotubes (hSKM). Though analyzing the cultered data analysis is beyond the scope of the course, there are several groups that the dataset can be clustered into (i.e. by demographic of timepoint) for differential analysis.
Why is the dataset of interest to you?
I have been engaged in sleep research for the past few years at the level of physiological signals. Given my knowledge and previous experience in the sleep domain, exploring a dataset with biological signals related to sleep and circadian rhythms seemed quite fitting.
Were there expression values that were not unique for specific genes? How did you handle these? There were 3 values were each mapped to 2 genes. This is an insignificant amount of replication so we will simply filter both replications of these genes out in downstream analysis.
Were there expression values that could not be mapped to current HUGO symbols?
Yes there were 148 expression values that could not be mapped to current HUGO symbols.
How many outliers were removed?
No outliers were removed. There are some potential outliers shown in the MDS plot, but seem consistent enough to consider. We will keep all samples for now as they may reveal important insight in downstream analysis.
How did you handle replicates?
There were no replicates in the dataset provided as it had already been cleaned prior to uploading to GEO.
What is the final coverage of your dataset?
(1) The dataset provided 16093 genes.
(2) However, the dataset provided was already pre-filtered according to the conditions described in the Filtering and Normalizing section.
(3) We were able to map 15945 ensembl ids to current hgnc symbols.
(4) We then identified 3 genes that were replicated, which will be filtered out downstream. Overall, we are left with 1.59410^{4} genes to work with, accounting for 99.0492761% of the genes in the dataset. Not too bad :)
[1]L. Perrin et al., “Transcriptomic analyses reveal rhythmic and CLOCK-driven pathways in human skeletal muscle,” eLife, vol. 7, p. e34114, Apr. 2018, doi: 10.7554/eLife.34114.
[2]M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, vol. 26, no. 1, pp. 139–140, Jan. 2010, doi: 10.1093/bioinformatics/btp616.
The dataset being used in this analysis is based on what is published in [1]. The dataset samples #### Analysis Objectives:
First we will install the necessary packages:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages(pkgs = c("BiocManager"),
repos = "http://cran.rstudio.org",
dependencies = TRUE,
quiet = TRUE)
}
if (!requireNamespace("circlize", quietly = TRUE)) {
install.packages(pkgs = c("circlize"),
repos = "http://cran.rstudio.org",
dependencies = TRUE,
quiet = TRUE)
}
if (!requireNamespace("gprofiler2", quietly = TRUE)) {
install.packages(pkgs = c("circlize"),
repos = "http://cran.rstudio.org",
dependencies = TRUE,
quiet = TRUE)
}
BiocManager::install("ComplexHeatmap")
BiocManager::install("limma")
BiocManager::install("GOstats")
library(Biobase)
library(circlize)
library(limma)
library(grid)
library(gprofiler2)
normalized_count_data <- normalized_counts_annot_filtered
First we will setup or data variables:
intron_cols <- grep("Intron.*|ensembl_gene_id|hgnc_symbol*", names(normalized_count_data), value = TRUE)
exon_cols <- grep("Exon*|ensembl_gene_id|hgnc_symbol", names(normalized_count_data), value = TRUE)
normalized_count_data_introns <- normalized_count_data[ , intron_cols]
normalized_count_data_exons <- normalized_count_data[ , exon_cols]
heatmap_matrix <- normalized_count_data[ ,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$ensemble_gene_id
colnames <- colnames(normalized_count_data[ ,3:ncol(normalized_count_data)])
heatmap_matrix_intron <- normalized_count_data_introns[ ,3:ncol(normalized_count_data_introns)]
rownames(heatmap_matrix_intron) <- normalized_count_data_introns$ensemble_gene_id
colnames <- colnames(normalized_count_data_introns[ ,3:ncol(normalized_count_data_introns)])
heatmap_matrix_exon <- normalized_count_data_exons[ ,3:ncol(normalized_count_data_exons)]
rownames(heatmap_matrix_intron) <- normalized_count_data_exons$ensemble_gene_id
colnames <- colnames(normalized_count_data_exons[ ,3:ncol(normalized_count_data_exons)])
Note: Code for Heatmap matrix for unthresholded dataset is commented out and hidden as it takes too long to execture.
gene_of_interest <- which(normalized_count_data_exons$hgnc_symbol == "CLOCK")
participants <- c("Exon_15", "Exon_3", "Exon_2","Exon_6", "Exon_7", "Exon_9", "Exon_13", "Exon_4", "Exon_12", "Exon_1")
timepoint_clock_matrix <- matrix(nrow=10, ncol=0)
row.names(timepoint_clock_matrix) <- participants
create_timepoint_matrix <- function(grep_pattern, timepoint) {
tp <- grep(colnames(normalized_count_data_exons),
pattern=grep_pattern)
m <- (t(normalized_count_data_exons
[gene_of_interest, tp]))
temp_rownames <- strsplit(rownames(m), " ")
new_rownames <- lapply(temp_rownames, function(x) {
new_name <- unlist(strsplit(x, "_"))[1:2]
new_name <- paste(new_name[1], new_name[2], sep="_")
return(new_name)
})
colnames(m) <- c(timepoint)
row.names(m) <- new_rownames
return(m)
}
t1 <- create_timepoint_matrix("Exon.*_12.00", "12.00")
t2 <- create_timepoint_matrix("Exon.*_16.00", "16.00")
t3 <- create_timepoint_matrix("Exon.*_20.00", "20.00")
t4 <- create_timepoint_matrix("Exon.*_00.00", "00.00")
t5 <- create_timepoint_matrix("Exon.*_04.00", "04.00")
t6 <- create_timepoint_matrix("Exon.*_08.00", "08.00")
matricies <- list(t1, t2, t3, t4, t5, t6)
for (l in matricies) {
timepoint_clock_matrix <- cbind(timepoint_clock_matrix, l[, 1][match(rownames(timepoint_clock_matrix), rownames(l))])
}
colnames(timepoint_clock_matrix) <- c("12.00", "16.00", "20.00", "00.00", "04.00", "08.00")
timepoint_clock_matrix
## 12.00 16.00 20.00 00.00 04.00 08.00
## Exon_15 2.018295 2.359913 2.446082 2.5795096 2.462013 0.7254062
## Exon_3 1.668517 1.909955 2.125634 1.9915728 2.019576 0.7396265
## Exon_2 1.920443 1.972585 2.356818 2.2837716 2.296801 2.0533181
## Exon_6 1.612411 1.659358 1.951415 2.0688471 1.955065 1.8176283
## Exon_7 1.798982 1.639746 2.054475 NA NA NA
## Exon_9 1.706860 1.810690 2.045504 2.2416810 1.988155 1.6351701
## Exon_13 1.796529 1.769102 1.968643 2.0990352 2.106858 1.8899544
## Exon_4 1.878297 2.103864 2.220510 2.4258696 2.149684 1.8286810
## Exon_12 1.537008 1.638882 1.976554 0.1883507 2.360112 1.7235454
## Exon_1 1.630994 1.613856 1.644382 1.4006078 1.610527 1.2192292
To compare the CLOCK gene across our timepoint groups, we will perform an ANOVA as we have more than 2 groups in the data.
df_clock_timepoint <- data.frame()
for (col in 1:ncol(timepoint_clock_matrix)) {
values <- as.data.frame(timepoint_clock_matrix[col, ])
timepoints <- rep(c(colnames(timepoint_clock_matrix)[col]), 6)
values <- cbind(values, timepoints)
colnames(values) <- c("values", "timepoints")
rownames(values) <- NULL
df_clock_timepoint <- rbind(df_clock_timepoint, values)
}
df_clock_timepoint[1:5, ]
aov_clock <- aov(df_clock_timepoint$values ~ df_clock_timepoint$timepoints)
summary(aov_clock)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_clock_timepoint$timepoints 5 0.741 0.1481 0.897 0.497
## Residuals 27 4.456 0.1651
## 3 observations deleted due to missingness
Our P value is >0.05, hence we equal variance between the timepoints.
Here we revisit the MDS plot developed in A1, but using the limma package. Again, we notice For the most part, samples in the intron and exon class cluster together with some outliers away from the 2 main clustering sites.
pat_colors <- rainbow(10)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
limma::plotMDS(heatmap_matrix_exon,
col = pat_colors)
Typically, our sampels cluster together, though we have 4 outliers to the right of the graph.
In the study related to the dataset, they are looking at which genes are expressed at the 6 different timepoints to get a sense of the genes involved in the human circadian rhythm.
Here, we will fit our data from each time point to a linear model. To do this, we will first we will segegrate our patient samples intro groups by patient and timepoint of sample.
samples <- data.frame(
lapply(colnames(normalized_count_data_exons)[3:ncol(normalized_count_data_exons)],
FUN=function(x){
unlist(strsplit(x, split = "_"))[c(2,4)]}))
colnames(samples) <- colnames(normalized_count_data_exons)[3:ncol(normalized_count_data_exons)]
rownames(samples) <- c("patient","timepoint")
samples <- data.frame(t(samples))
samples[1:5, ]
We will now create our model design. We have 6 different timepoints where data was collected, and so will we design the model to account for this.
model_design <- model.matrix(~ samples$timepoint)
model_design[1:5, 1:6]
## (Intercept) samples$timepoint04.00 samples$timepoint08.00
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## samples$timepoint12.00 samples$timepoint16.00 samples$timepoint20.00
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
Now we will our create our data matrix to apply the model to and get the P-values for our genes.
expressionMatrix <- as.matrix(normalized_count_data_exons[,3:ncol(normalized_count_data_exons)])
rownames(expressionMatrix) <- normalized_count_data_exons$ensembl_gene_id
colnames(expressionMatrix) <- colnames(normalized_count_data_exons)[3:ncol(normalized_count_data_exons)]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)
fit2 <- eBayes(fit, trend=TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc symboles to the topfit table
output_hits <- merge(normalized_count_data_exons[,1:2],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort the output lists by decreasing pvaluee
output_hits <- output_hits[order(output_hits$P.Value),]
output_hits[1:5, ]
genes_pass <- length(which(output_hits$P.Value < 0.05))
genes_pass_correction <- length(which(output_hits$adj.P.Val < 0.05))
We have 1096 genes that pass the p-value threshold of <0.05. Further, we have 1 genes that pass the corrected P value threshold.
We have a sizeable amount of genes that pass the p-value threshold of 0.05, so we will represent these genes that passed through an MA plot and a heatmap.
xlim <- c(-3,1)
ylim <- c(-4,4)
passed_genes <- topfit[which(topfit$P.Value<0.05), ]
not_passed_genes <- topfit[which(topfit$P.Value>=0.05), ]
limma::plotMA(topfit, main="MA Plot: Passed Threshold Genes", ylim=ylim, xlim=xlim)
limma::plotMA(not_passed_genes, main="MA Plot: Other Genes", ylim=ylim, xlim=xlim)
As we can see the difference in the 2 plots, the p-value of 0.05 filters out a significant number of outliers.
We will now create a heatmap based on the thresholded gene list.
heatmap_matrix_new <- normalized_count_data_exons[,3:ncol(normalized_count_data_exons)]
rownames(heatmap_matrix_new) <- normalized_count_data_exons$ensembl_gene_id
colnames(heatmap_matrix_new) <- colnames(normalized_count_data_exons[,3:ncol(normalized_count_data_exons)])
heatmap_matrix_new <- t(scale(t(heatmap_matrix_new)))
top_hits <- output_hits$ensembl_gene_id[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix_new[
which(rownames(heatmap_matrix_new) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = circlize::colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = FALSE,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_title = "Heatmap of top gene hits - Limma Model",
column_names_gp = gpar(fontsize = 8)
)
current_heatmap
We see significant more expression on the left and very right ends of this heatp. Looking at these time points, they are typically 0:00, 4:00, and 8:00, times that these participants were likely asleep.
Now we will continue our differential analysis to see differential expression using a different model, the Quasi liklihood model.
First we will get our filtered count data from A1 and create our Quasi liklihood model and show the head of the model.
exon_data_matrix <- inverse_data_matrix[ , grepl("Exon.*", colnames(inverse_data_matrix))]
d = DGEList(counts=exon_data_matrix, group=samples$timepoint)
d <- estimateDisp(d, model_design)
fit <- glmQLFit(d, model_design)
model_design[1:10, 1:5]
## (Intercept) samples$timepoint04.00 samples$timepoint08.00
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## 6 1 0 1
## 7 1 0 0
## 8 1 0 0
## 9 1 0 0
## 10 1 0 0
## samples$timepoint12.00 samples$timepoint16.00
## 1 1 0
## 2 0 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 1 0
## 8 0 1
## 9 0 0
## 10 0 0
qlf.pos_vs_neg <- glmQLFTest(fit)
topTags(qlf.pos_vs_neg)
## Coefficient: samples$timepoint20.00
## logFC logCPM F PValue FDR
## ENSG00000169429 -7.908888 6.134667 39.07578 6.733919e-08 0.001083690
## ENSG00000119508 -4.092447 4.686381 36.61462 1.413032e-07 0.001136996
## ENSG00000125740 -5.695278 4.785103 34.88074 2.412801e-07 0.001294307
## ENSG00000158050 -3.994121 4.387527 32.76261 4.709932e-07 0.001727587
## ENSG00000120738 -4.676144 6.104781 32.35508 5.367511e-07 0.001727587
## ENSG00000122877 -4.652983 4.130170 31.52152 7.026935e-07 0.001884741
## ENSG00000271762 -3.912207 5.534392 30.56867 9.594209e-07 0.002000830
## ENSG00000144655 -3.281010 5.589754 30.37537 1.022459e-06 0.002000830
## ENSG00000248323 -4.051097 5.273990 30.10221 1.118963e-06 0.002000830
## ENSG00000081041 -3.327333 4.934630 29.46007 1.384968e-06 0.002228829
Next, we will look at how many genes pass the thresholds with the Quasi.
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(normalized_count_data))
quasi_pvalue_pass <- length(which(qlf_output_hits$table$PValue < 0.05))
quasi_corrected_pass <- length(which(qlf_output_hits$table$FDR < 0.05))
With the Quasi-liklihood model, 939 genes pass the threshold p-value <0.05 and 297 pass the correction.
Next we will compare the different models we generated (Limma vs Quasi liklihood).
qlf_pat_model_pvalues <- data.frame(
ensembl_id = rownames(qlf_output_hits$table),
qlf_patient_pvalue=qlf_output_hits$table$PValue)
limma_pat_model_pvalues <- data.frame(
ensembl_id = output_hits$ensembl_gene_id,
limma_patient_pvalue = output_hits$P.Value)
two_models_pvalues <- merge(qlf_pat_model_pvalues,
limma_pat_model_pvalues,
by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$limma_patient_pvalue<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue<0.05 & two_models_pvalues$limma_patient_pvalue<0.05] <- "red"
plot(two_models_pvalues$qlf_patient_pvalue,
two_models_pvalues$limma_patient_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF patient model p-values",
ylab ="Limma Patient model p-values",
main="QLF vs Limma")
As we can see, there is a small overlap between the two models, shown in red. Blue points represent genes overexpressed only in the Quasi model and yellow represent the genes overpressed only in the limma model. We will highlight some genes of interest, known to contribute to circadian rythym as described in [1].
ensembl_of_interest <- normalized_count_data_exons$ensembl_gene_id[
which(normalized_count_data_exons$hgnc_symbol == "CLOCK")]
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$ensembl_id==ensembl_of_interest] <- "red"
plot(two_models_pvalues$qlf_patient_pvalue,
two_models_pvalues$limma_patient_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF patient model p-values",
ylab ="Limma Patient model p-values",
main="QLF vs Limma")
points(two_models_pvalues[
two_models_pvalues$ensembl_id==ensembl_of_interest,2:3],
pch=24, col="red", cex=1.5)
The red triangles are our genes of interest. We can see that our genes of interest strong correlation of expression in both models.
Now we will generate a heatmap for the Quasi Likelihood model.
top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix_new[which(rownames(heatmap_matrix_new) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = circlize::colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_title = "Heatmap of top gene hits - Quasi liklihood Model",
column_names_gp = gpar(fontsize = 8)
)
current_heatmap
Compared to the limma heatmap, we see signficantly less differentiation here. However, the pattern of more differentation at timepoints 00:00 and 04:00 (times where participants are likely asleep) remains consistent between the two models.
qlf_output_hits_withgn <- merge(normalized_count_data[,1:2],qlf_output_hits, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
n_upreg_genes <- length(upregulated_genes)
n_downreg_genes <- length(downregulated_genes)
We have identified 6 upregulated genes and 925 downregulated genes to be used in our gene enrichment analysis.
We will use the gprofiler package to a do a preliminary gene set analysis. We will perform a query with our upregulated genes and then our downregulated genes.
gostres_upreg <- gprofiler2::gost(upregulated_genes, organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = TRUE,
measure_underrepresentation = FALSE, evcodes = FALSE,
user_threshold = 0.05, correction_method = c("g_SCS", "bonferroni",
"fdr", "false_discovery_rate", "gSCS", "analytical"),
domain_scope = c("annotated"), custom_bg = NULL,
numeric_ns = "", sources = NULL)
gostres_downreg <- gprofiler2::gost(downregulated_genes, organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = TRUE,
measure_underrepresentation = FALSE, evcodes = FALSE,
user_threshold = 0.05, correction_method = c("g_SCS", "bonferroni",
"fdr", "false_discovery_rate", "gSCS", "analytical"),
domain_scope = c("annotated"), custom_bg = NULL,
numeric_ns = "", sources = NULL)
gostres_upreg$result
gprofiler2::gostplot(gostres_upreg, capped = TRUE, interactive = TRUE, pal = c(`GO:MF`
= "#dc3912", `GO:BP` = "#ff9900", `GO:CC` = "#109618", KEGG = "#dd4477",
REAC = "#3366cc", WP = "#0099c6", TF = "#5574a6", MIRNA = "#22aa99", HPA
= "#6633cc", CORUM = "#66aa00", HP = "#990099"))
gprofiler2::gostplot(gostres_downreg, capped = TRUE, interactive = TRUE, pal = c(`GO:MF`
= "#dc3912", `GO:BP` = "#ff9900", `GO:CC` = "#109618", KEGG = "#dd4477",
REAC = "#3366cc", WP = "#0099c6", TF = "#5574a6", MIRNA = "#22aa99", HPA
= "#6633cc", CORUM = "#66aa00", HP = "#990099"))
These two plots show us the gene sets that our upregulated and downregulated genes belong to. There are 11 gene sets for both the upregulated and downregulated genes. There is noticable different in the number of genes for upregulated vs. downregulated. For example, HPA and CORUM sets have significantly more genes in upregulated.
[1] L. Perrin et al., “Transcriptomic analyses reveal rhythmic and CLOCK-driven pathways in human skeletal muscle,” eLife, vol. 7, p. e34114, Apr. 2018, doi: 10.7554/eLife.34114.
[2] U. Raudvere et al., “g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update),” Nucleic Acids Research, vol. 47, no. W1, pp. W191–W198, Jul. 2019, doi: 10.1093/nar/gkz369.